Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling
نویسندگان
چکیده
When using stochastic gradient descent (SGD) to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple threads/machines if needed, and then perform several epochs of training on the re-shuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set, which contradicts with the basic assumptions of conventional convergence analysis of SGD. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call (data partition with) global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. We prove that SGD with global shuffling has convergence guarantee in both convex and non-convex cases. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Second, we conduct the convergence analysis for SGD with local shuffling. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there’s no communication between partitioned data. Finally, we consider the situation when the permutation after shuffling is not uniformly distributed (We call it insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks.
منابع مشابه
Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization
We provide the first theoretical analysis on the convergence rate of the asynchronous stochastic variance reduced gradient (SVRG) descent algorithm on nonconvex optimization. Recent studies have shown that the asynchronous stochastic gradient descent (SGD) based algorithms with variance reduction converge with a linear convergent rate on convex problems. However, there is no work to analyze asy...
متن کاملFast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit
Modern supervised learning techniques, particularly those using so called deep nets, involve fitting high dimensional labelled data sets with functions containing very large numbers of parameters. Much of this work is empirical, and interesting phenomena have been observed that require theoretical explanations, however the non-convexity of the loss functions complicates the analysis. Recently i...
متن کاملThe Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory
Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on distributed machine learning, significant work has been dedicated to the convergence properties of this algorithm under the inconsistent and noisy updates arising from...
متن کاملVariance Reduction for Distributed Stochastic Gradient Descent
Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical dis...
متن کاملBrief Announcement: Byzantine-Tolerant Machine Learning
We report on Krum, the rst provably Byzantine-tolerant aggregation rule for distributed Stochastic Gradient Descent (SGD). Krum guarantees the convergence of SGD even in a distributed setting where (asymptotically) up to half of the workers can be malicious adversaries trying to attack the learning system.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1709.10432 شماره
صفحات -
تاریخ انتشار 2017